Skip to content

BUG/PERF: Series.combine_first converting int64 to float64 #51777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions labels Mar 4, 2023
@jessestone7
Copy link

If the solution to this issue is to convert floats back to ints when they get converted to floats, that will not work correctly for some ints. For example:

>>> s1 = pd.Series([1666880195890293744, 1666880195890293837])
>>>
>>> s1
0    1666880195890293744
1    1666880195890293837
dtype: int64
>>>
>>> s1.astype(float).astype(int)
0    1666880195890293760
1    1666880195890293760
dtype: int64

It would be good to include big ints like these in the tests.

@@ -3272,7 +3274,13 @@ def combine_first(self, other) -> Series:
if this.dtype.kind == "M" and other.dtype.kind != "M":
other = to_datetime(other)

return this.where(notna(this), other)
combined = this.where(notna(this), other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this won't break anything, but for the series case there might be a proper fix. You could set a compatible fill_value for the reindex ops. Only disadvantage is, that you'll have to compute notna(this) somehow

@lukemanley
Copy link
Member Author

lukemanley commented Mar 4, 2023

If the solution to this issue is to convert floats back to ints when they get converted to floats, that will not work correctly for some ints

great catch! thanks. I've updated the PR to handle this and updated the test.

This also provides a nice perf improvement:

import pandas as pd
import numpy as np

N = 1_000_000

s1 = pd.Series(np.random.randint(0, N, N), dtype="int64")
s1 = s1.iloc[:-5]

s2 = pd.Series(np.random.randint(0, N, N), dtype="int64")
s2 = s2.iloc[5:]

%timeit s1.combine_first(s2)

# 59.4 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)     -> main
# 1.37 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  -> PR

And I'll note the perf improvement is not specific to the integer case, here is float64:

import pandas as pd
import numpy as np

N = 1_000_000

s1 = pd.Series(np.random.randn(N), dtype="float64")
s1 = s1.iloc[:-5]

s2 = pd.Series(np.random.randn(N), dtype="float64")
s2 = s2.iloc[5:]

%timeit s1.combine_first(s2)

# 48.4 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)      -> main
# 1.64 ms ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  -> PR

@lukemanley lukemanley added the Performance Memory or execution speed performance label Mar 4, 2023
@lukemanley lukemanley changed the title BUG: Series.combine_first converting int64 to float64 BUG/PERF: Series.combine_first converting int64 to float64 Mar 4, 2023
return this.where(notna(this), other)
combined = concat([this, other]).reindex(new_index, copy=False)
combined.name = self.name
return combined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a finalise call with self, otherwise we will loose metadata.

This might changed result ordering? Can we do a reindex in the end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

this = self
null_mask = isna(this)
if null_mask.any():
drop = this.index[null_mask].intersection(other.index[notna(other)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a quick comment what this is doing? Took me a while to figure it out, don't want to do this again in 6 months time :)

Otherwise lgtm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've simplified this logic a bit. should be easier to follow now

@jessestone7
Copy link

It looks like you are using these big integers in the tests to test for loss of precision:

>>> s1 = Series([np.iinfo(np.int64).min, np.iinfo(np.int64).max])

But note that there is no loss in going from int to float to int for these special numbers:

>>> s1
0   -9223372036854775808
1    9223372036854775807
dtype: int64

>>> s1.astype(float).astype(int)
0   -9223372036854775808
1    9223372036854775807
dtype: int64

It would be good to use ints in the tests where loss of precision actually could happen, e.g.:

>>> s1 = pd.Series([1666880195890293744, 1666880195890293837])

>>> s1
0    1666880195890293744
1    1666880195890293837
dtype: int64

>>> s1.astype(float).astype(int)
0    1666880195890293760
1    1666880195890293760
dtype: int64

@lukemanley
Copy link
Member Author

updated, thanks @jessestone7

@mroeschke mroeschke merged commit 78947dd into pandas-dev:main Mar 10, 2023
@mroeschke
Copy link
Member

Thanks again @lukemanley

@lukemanley lukemanley deleted the series-combine-first-preserve-dtype branch March 17, 2023 22:04
@jessestone7
Copy link

I just tried this on Pandas version 2.2.2 and I see that there is a loss of precision:

>>> a = pd.Series([1666880195890293744, 5]).to_frame()
>>> b = pd.Series([6, 7, 8]).to_frame()
>>> a
                     0
0  1666880195890293744
1                    5
>>> b
   0
0  6
1  7
2  8
>>> a.combine_first(b)
                     0
0  1666880195890293760
1                    5
2                    8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: preserve the dtype on Series.combine_first
4 participants